Exploratory Analysis

Reviewing and Summarizing Data

A good first step is to review the data that we will be working with. First we should know the name of the factors contained in our data, the shape they are currently in and some basic summary statistics.

names(WineData)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
str(WineData)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
summary(WineData)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Summary of Data

From the output we can see there are 1599 observatins in the data across 17 variables, though one variable ‘X’ is simply used as a unique identifier for our entries. The majority of our variables appear to be continuous in nature, with the exception of quality and rating, which appear to be discrete in nature. This makes sense given that things like quality and rating are typically measured on something like a likert scale. From the variable descriptions, it appears that fixed.acidity ~ volatile.acidity and free.sulfur.dioxide ~ total.sulfur.dioxide may possibly be dependent, subsets of each other.

The focus of this analysis is on the factors contributing to wine quality. And since we’re primarily interested in quality, we shoudl provide additional explanation of what we have find so far from summary and makeup of quality.

Some initial observations here: - From the literature, quality was measures on a 0-10 scale, and was rated by at least 3 wine experts. The values ranged only from 3 to 8, with a mean of 5.6 and median of 6. - All other variables seem to be continuous quantities (w/ the exception of the .sulfur.dioxide suffixes).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##  Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...

Univariate Plots Section

To first explore this data visually, I’ll draw up quick histograms of all 12 variables to get a better idea as to the shape of our data. The intention here is to see a quick distribution of the values.

Univariate Analysis Seection

Contributing Factors

Only a few of the factors appear to be normally distributed, density and pH. Fixed acidity and volatile aciidity appear to be somewhat bimodal. While citric.acid and free.slfur appear to have a plateau distribution with choloride and total sulfer with a left skewed distribution. While we could play around with decreasing or increasing bin sizes to achieve a normal distribution this would distort the data and not something we want todo in this exploratory phase.

Wine Quality

Although wine quality has a discrete range of only 3-8, we can roughly see that there is some amount of normal distribution. A large majority of the wines examined received ratings of 5 or 6, and very few received 3, 4, or 8.

##    3   4   5   6   7   8
##                         
##   10  53 681 638 199  18

Given the ratings and distribution of wine quality, I’ll instantiate another categorical variable, classifying the wines as ‘poor’ (rating 0 to 4), ‘average’ (rating 5 or 6), and ‘good’ (rating 7 to 10).

##    poor average    good 
##      63    1319     217

Distributions and Outliers

  • It appears that density and pH are normally distributed, with few outliers.
  • Fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol seem to be long-tailed.
  • In looking at residual sugar and chlorides there appear to be outliers, though using a histogram to visulizae this isn’t the best idea. We will examine outliers using a box and whisker plot later on in the analysis.
  • Citric acid appeares to have a large number of zero values.

When plotted on a base 10 logarithmic scale, fixed.acidity and volatile.acidity appear to be normally-distributed. This makes sense, considering that pH is normally distributed, and pH, by definition, is a measure of acidity and is on a logarithmic scale. However, citric.acid, did not appear to be normally-distributed on a logarithmic scale. Upon further investigation:

## [1] 132

The initial plot for citric.acid appears to have a large number of observations with the value of zero. In an attempt to have a more prcise count, lets get an exact number. The exact number of observations with the value of zero is 132. This yields some concerns on whether or not these 132 values were reported or not, considering that the next ‘bin’ higher contains only 32 observations.

Short questions

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Given that the number of factors is relatively small, examining all of them is not out of the question in exploring their relationship to rating. Doing so will help to narrow down which factors impact rating.

Did you create any new variables from existing variables in the dataset?

I instantiated an ordered factor, rating, classifying each wine sample as ‘poor’, ‘average’, or ‘good’.

Upon further examination of the data set documentation, it appears that fixed.acidity and volatile.acidity are different types of acids; tartaric acid and acetic acid. I decided to create a combined variable, TAC.acidity, containing the sum of tartaric, acetic, and citric acid.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I addressed the distributions in the ‘Distributions’ section. But as I mentioned above, boxplots are better suited in visualizing the outliers.

Bivariate boxplots, with X as rating or quality, will be more interesting in showing trends with wine quality.

Bivariate Plots and Analysis

To get a quick snapshot of how the variables affect quality, I generated box plots for each.

From exploring these plots, it seems that a ‘good’ wine generally has these trends:

  • higher fixed acidity (tartaric acid) and citric acid, lower volatile acidity (acetic acid)
  • lower pH (i.e. more acidic)
  • higher sulphates
  • higher alcohol
  • to a lesser extend, lower chlorides and lower density

Residual sugar and sulfur dioxides did not seem to have a dramatic impact on the quality or rating of the wines. Interestingly, it appears that different types of acid affect wine quality different; as such, TAC.acidity saw an attenuated trend, as the presence of volatile (acetic) acid accompanied decreased quality.

By utilizing cor.test, I calculated the correlation for each of these variables against quality:

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
##          TAC.acidity log10.residual.sugar      log10.chlordies 
##           0.10375373           0.02353331          -0.17613996 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05065606          -0.18510029          -0.17491923 
##                   pH      log10.sulphates              alcohol 
##          -0.05773139           0.30864193           0.47616632

Quantitatively, it appears that the following variables have relatively higher correlations to wine quality:

  • alcohol
  • sulphates (log10)
  • citric acid
  • fixed.acidity

Let’s see how these variables compare, plotted against each other and faceted by wine rating to have a better look at the distribution from the scatterplot matrix:

The relative value of these scatterplots are suspect; if anything, it illustrates how heavily alcohol content affects rating. The weakest bivariate relationship appeared to be alcohol vs. citric acid. The plots were nearly uniformly-distributed. The strongest relationship appeared to be volatile acididty vs. citric acid, which had a negative correlation.

Examining the acidity variables, I saw strong correlations between them:

## 
##  Pearson's product-moment correlation
## 
## data:  WineData$fixed.acidity and WineData$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

## 
##  Pearson's product-moment correlation
## 
## data:  WineData$volatile.acidity and WineData$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

## 
##  Pearson's product-moment correlation
## 
## data:  log10(WineData$TAC.acidity) and WineData$pH
## t = -39.663, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7283140 -0.6788653
## sample estimates:
##        cor 
## -0.7044435

Most notably, base 10 logarithm TAC.acidity correlated very well with pH. This is certainly expected, as pH is essentially a measure of acidity. An interesting question to pose, using basic chemistry knowledge, is to ask what other components other than the measured acids are affecting pH. We can quantify this difference by building a predictive linear model, to predict pH based off of TAC.acidity and capture the % difference as a new variable.

The median % error hovered at or near zero for most wine qualities. Notably, wines rated with a quality of 3 had large negative error. We can interpret this finding by saying that for many of the ‘bad’ wines, total acidity from tartaric, acetic, and citric acids were a worse predictor of pH. Simply put, it is likely that there were other components–possibly impurities–that changed and affected the pH.

As annotated previously, I hypothesized that free.sulfur.dioxide and total.sulfur.dioxide were dependent on each other. Plotting this:

## 
##  Pearson's product-moment correlation
## 
## data:  WineData$free.sulfur.dioxide and WineData$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

It is clear that there is a very strong relationship between the two. Aside from TAC.acidity, this seemed to be the strongest bivariate relationship. Additionally, despite the telling name descriptions, the clear ‘floor’ on this graph hints that free.sulfur.dioxide is a subset of total.sulfur.dioxide.

Multivariate Plots

Multivariate Analysis

I primarily examined the 4 features which showed high correlation with quality. These scatterplots were a bit crowded, so I faceted by rating to illustrate the population differences between good wines, average wines, and poor wines. It’s clear that a higher citric acid and lower volatile (acetic) acid contributes towards better wines. Likewise, better wines tended to have higher sulphates and alcohol content. Surprisingly, pH had very little visual impact on wine quality, and was shadowed by the larger impact of alcohol. Interestingly, this shows that what makes a good wine depends on the type of acids that are present.

Final Plots and Summary

Plot 1: Effect of acids on wine quality

These subplots were created to demonstrate the effect of acidity and pH on wine quality. Generally, higher acidity (or lower pH) is seen in highly-rated wines. To caveat this, a presence of volatile (acetic) acid negatively affected wine quality. Citric acidity had a high correlation with wine quality, while fixed (tartaric) acid had a smaller impact.

Plot 2: Effect of Alcohol on Wine Quality

These boxplots demonstrate the effect of alcohol content on wine quality. Generally, higher alcohol content correlated with higher wine quality. However, as the outliers and intervals show, alchol content alone did not produce a higher quality.

Plot 3: Factors that impact wine quality?

## `geom_smooth()` using method = 'loess'

This is perhaps the most telling graph. I subsetted the data to remove the ‘average’ wines, or any wine with a rating of 5 or 6. As the correlation tests show, wine quality was affected most strongly by alcohol and volaticle acidity. While the boundaries are not as clear cut or modal, it’s apparent that high volatile acidity–with few exceptions–kept wine quality down. A combination of high alcohol content and low volatile acidity produced better wines.

Reflection

Through this exploratory analysis, certain factors determine and drive wine quality, mainly: alcohol content, sulphates, and acidity. Something to keep in mind in how this data was collected is that it used human ratings, which can be extremely subjective. That said, the correlations for these variables are within reasonable bounds. The graphs adequately illustrate the factors that make good wines ‘good’ and poor wines ‘poor’. Further study with inferential statistics could be done to quantitatively confirm these assertions.